My team covers a range of policy issues in the city:
--> Check out https://controller.phila.gov/policy-analysis
Goal: exploring and extracting insight from complex datasets
Course design
Note: there will be an introductory survey for all students and it will ask about the possibility of moving lectures and preferred time slots (this will only occur if a time that works for everyone can be found)
Course has four websites (sorry!). They are:
Each will have its own purpose:
Exploratory Data Science: Students will be introduced to the main tools needed to get started analyzing and visualizing data using Python
Introduction to Geospatial Data Science: Building on the previous set of tools, this module will teach students how to work with geospatial datasets using a range of modern Python toolkits.
Data Ingestion & Big Data: Students will learn how to collect new data through web scraping and APIs, as well as how to work effectively with the large datasets often encountered in real-world applications.
Geospatial Data Science in the Wild: Armed with the necessary data science tools, students will be introduced to a range of advanced analytic and machine learning techniques using a number of innovative examples from modern researchers.
From Exploration to Storytelling: The final module will teach students to present their analysis results using web-based formats to transform their insights into interactive stories.
Homeworks will be assigned (roughly) every two weeks. You must complete five of the seven homework assignments. Four of the assignments are required, and you are allowed to choose the last assignment to complete (out of the remaining three options).
The final project is to replicate the pipeline approach on a dataset (or datasets) of your choosing.
Students will be required to use several of the analysis techniques taught in the class and produce a web-based data visualization that effectively communicates the empirical results to a non-technical audience.
More info will be posted here: https://github.com/MUSA-550-Fall-2020/final-project
Very versatile: good for both exploratory data analysis and polished finished products
Note: as a free service, it can be a bit slow sometimes
To follow along today, go to https://github.com/MUSA-550-Fall-2020/week-1
These slides are a Jupyter notebook.
A mix of code cells and text cells in Markdown. Change the type of cell in the top menu bar.
# Comments begin with a "#" character in Python
# A simple code cell
# SHIFT-ENTER to execute
x = 10
print(x)
# integer
a = 10
# float
b = 10.5
# string
c = "this is a test string"
# lists
d = list(range(10))
# booleans
e = True
# dictionaries
f = {'key1': 1, "key2": 2}
print(a)
print(b)
print(c)
print(d)
print(e)
print(f)
f = dict(key1=1, key2=2, key3=3)
# access the value with key 'key1'
f['key1']
# access the second list entry (0 is the first index)
d[1]
# the first character
c[0]
# Python code
result = 0
for i in range(100):
result = result + i
print(result)
a = range(10) # this is an iterator
print(a)
# convert it to a list explicitly
a = list(range(10))
print(a)
# or use the INLINE syntax; this is the SAME
a = [i for i in range(10)]
print(a)
def function_name(arg1, arg2, arg3):
.
.
.
code lines (indented)
.
.
.
return result
def compute_square(x):
return x * x
sq = compute_square(5)
print(sq)
def compute_product(x, y=5):
return x * y
# use the default value for y
print(compute_product(5))
# specify a y value other than the default
print(compute_product(5, 10))
# can also explicitly tell Python which arguments are which
print(compute_product(5, y=2))
print(compute_product(x=5, y=2))
# argument names must match the function signature though!
print(compute_product(5, z=5))
Use tab auto-completion and the ? and ?? operators
this_variable_has_a_long_name = 5
# try hitting tab after typing this_
#this_
# try typing "r" and then tab
# r
# Forget how to create a range? --> use the help message
range?
Use the ?? operator
compute_product??
The question mark operator gives you access to the help message for any variable or function. I use this frequently and it is the primary method I understand what functions do.
This was a very brief introduction. Additional Python tutorials are listed on our course website under "Resources"
https://musa-550-fall-2020.github.io/resources/python/
Recommend tutorial for students with little Python background:
There are also a few good resources from the Berkeley Data Science Institute:
The The Python Data Science Handbook is a free, online textbook covering the Python basics needed in this course. In particular, the first four chapters are excellent:
Note that you can click on the "Open in Colab" button for each chapter and run the examples interactively using Google Colab.
In this class, we will almost exclusively work inside Jupyter notebooks — you'll be writing Python code and doing data analysis directly in the notebook.
The more traditional method of using Python is to put your code into a .py file and execute it via the command line (known as the Anaconda Prompt on Windows or Terminal app on MacOS).
See this section of the Practical Python Programming tutorial for more info.
There is a file called hello_world.py in the repository for week 1. If we execute it, it should print out "Hello, World" to the command line.
Let's try it out.
You can run terminal commands directly in the Jupyter notebook's "code" cell by starting the line with a "!"
To list all of the files in the current folder (the "current working directory"), use the ls command:
! ls
We see the hello_world.py file listed. Now let's execute it on the command line by using the python command:
! python hello_world.py
Success!
When writing software outside the notebook, it's useful to have an application known as a "code editor". This will provide a nice interface for writing Python code and some even have fancy features, like real-time syntax checking and syntax highlighting.
My recommended option is Visual Studio Code.
The following line imports the pandas package:
import pandas as pd
The primary objects in pandas are the:
DataFrame, which is very similar to an Excel spreadsheet, and has rows and named columnsSeries, which represents a single column of data. A DataFrame contains one or more Series and a name for each Series.The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in R.
You can think Series objects as fancier versions of Python's built-in list data type
To create a Series object:
# city names and population
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])
city_names
DataFrame objects can be created by passing a dict mapping string column names to their respective Series.
cities_df = pd.DataFrame({ 'City Name': city_names, 'Population': population })
cities_df
Note: always try to give your variables meaningful names. It will help immensely when you are trying to debug problems or when you're trying to understand the code you've written months later.
# access columns with a dict-like syntax
cities_df['Population']
# list slicing: get the elements with indices 1 and 2 (but NOT 3)
cities_list = ['San Francisco', 'San Jose', 'Sacramento']
cities_list[1:3]
# slicing data frame rows is very similar!
cities_df[1:3]
pandas includes functionality for many different ways of selecting data. See the documentation for many more examples.
cities_df['Population'].median()
NumPy is a popular toolkit for scientific computing.
pandas Series can be used as arguments to most NumPy functions:
import numpy as np
# calculate the median population value
np.median(cities_df['Population'])
For more complex single-column transformations, you can use Series.apply. It accepts a function that is applied to each value in the Series.
For example, we can find which cities have a population greater than a million:
# define our function
def get_large_cities(population):
return population > 1e6
large_cities_sel = cities_df['Population'].apply(get_large_cities)
large_cities_sel
# add the new computed column to our original data frame
cities_df['Large Cities'] = large_cities_sel
# we can also use lambda (unnamed, inline) functions
cities_df['Small Cities'] = cities_df['Population'].apply(lambda population: population < 1e6)
# print out
cities_df
We can select the "large" cities by passing the boolean values to the .loc() function of a DataFrame:
cities_df['Large Cities']
cities_df.loc[cities_df['Large Cities']]
cities_df['Population'] > 1e6
# this is equivalent to doing
cities_df.loc[cities_df['Population'] > 1e6]
We can use the pandas query function.
The query() function will return a subset of your dataframe based on a string version of the boolean expression.
cities_df.query("Population > 1e6")
The documentation is your best friend! Use the question mark operator!
# use the question mark
pd.DataFrame.loc?
Let's load census data on rentals rates from the 2017 Amercian Community Survey.
We've stored the data locally in the comma-separated value (CSV) format. Later in the course, you'll learn how to download them directly using Python.
Use the read_csv() function from pandas. The first argument to read_csv() is the file path to load.
census_df = pd.read_csv("./data/census/ACS_17_1YR_B25003.csv")
census_df.head()
Select certain columns and rename them:
# columns we want
columns = ['GEO.display-label', 'HD01_VD01', 'HD01_VD03']
census_df = census_df[columns].copy()
# rename columns
census_df.columns = ['City', 'Total', 'Rentals']
census_df.head()
# Add the rental rate
census_df['Rental Rate'] = census_df['Rentals'] / census_df['Total'] * 100
census_df.head()
len(census_df)
We can select the largest cities by population using the Series.isin() function, which checks if each value in the Series is in the specified list.
top_cities_list = [
"Philadelphia city, Pennsylvania",
"New York city, New York",
"Los Angeles city, California",
"Houston city, Texas",
"Chicago city, Illinois",
"Phoenix city, Arizona",
]
# selection = # which rows are valid
census_df.loc[census_df["City"].isin(top_cities_list)] # select the valid rows
Alternatively, we could use the query() function. The syntax is a bit different (and I always forget and have to look at the documentation).
In the query() string, you can reference existing variables by prefacing the variable name with an "@" symbol.
For example:
# Get the rows where the City column is in "top_cities_list"
census_df.query("City in @top_cities_list")
Python uses the "%" operator to insert variable values into strings. For example, the file name of the data we want to load for 2017 is:
year_str = "17"
path = "./data/census/ACS_%s_1YR_B25003.csv" % year_str
print(path)
See this guide on this type of string formatting for more info.
Python version 3.6 introduced a new and improved string formatting syntax, where the variable that is being inserted can be referenced directly using the variable name.
The variable name should be enclosed in curly braces inside the string:
year_str = "17"
path = f"./data/census/ACS_{year_str}_1YR_B25003.csv"
print(path)
For more info: see this guide
Let's use a for loop to loop over each year value from 2005 to 2017
all_rental_data = []
for year in range(2005, 2018):
# The variable year changes for each iteration of the loop
print(f"Loading data for year = {year}...")
# read data for this year
year_str = str(year) # convert integer value of "year" to a string
year_str = year_str[2:] # extract the last two digits of the year string
rental_df = pd.read_csv(f"./data/census/ACS_{year_str}_1YR_B25003.csv")
# columns we want
columns = ['GEO.display-label', 'HD01_VD01', 'HD01_VD03']
rental_df = rental_df[columns]
# rename columns
rental_df.columns = ['City', 'Total', 'Rentals']
# calculate the rental rate
rental_df['Rental Rate'] = rental_df['Rentals'] / rental_df['Total'] * 100
# select the cities we want
selection = rental_df['City'].isin(top_cities_list)
top_cities_df = rental_df.loc[selection].copy()
# add a column for the year
top_cities_df['Year'] = year
# save
all_rental_data.append(top_cities_df)
# combine all of the data frames into one along the row axis
all_rental_data = pd.concat(all_rental_data, axis=0)
all_rental_data.head()
# What about the tail?
all_rental_data.tail()
matplotlib¶Much more to come next week. For now, we'll use some very simple plotting commands with Python's main plotting libary matplotlib.
Import the library we'll use from matplotlib.
from matplotlib import pyplot as plt
We'll use the plt.plot() function which plots a simple x vs y line.
with plt.style.context('ggplot'):
# Create a figure and axes to plot on
fig, ax = plt.subplots(figsize=(10,6))
# Loop over each city and plot
for city in all_rental_data['City'].unique():
print(city)
# Select data for this city
this_city = all_rental_data['City'] == city
city_df = all_rental_data.loc[this_city]
# Plot Year vs Rental Rate
ax.plot(city_df['Year'], city_df['Rental Rate'], label=city, linewidth=4)
# Add a legend
ax.legend(loc=0, ncol=3, fontsize=12)
# Set the y limit
ax.set_ylim(30, 72)
# Add a y label
ax.set_ylabel("Rental Rate in Percent");
Note: we used the "ggplot" theme to style our plots. Matplotlib comes with a number of built in styles: see all of the examples here.
Exercise: can you calculate what the increase was for Philadelphia?
Note how the first row of the above dataframe has an index label of 393. This means we can access it using the .loc[] function like:
philly.loc[393]
This is different than the .iloc[] function, which also indexes based on the integer value of the row, regardless of what the labels in the index are. For example, the first row of a dataframe can always be accessed using:
philly.iloc[0]
We can reset the index labels so they range from 0 to the length of the dataframe, using the reset_index() function. For example
philly.reset_index(drop=True)
Next, to calculate our percent difference we need to get the values for Philadelphia in 2005 and 2017:
# select 2005
philly_2005 = philly.loc[philly['Year']==2005]
value_2005 = philly_2005['Rental Rate'].iloc[0]
print("2005 value = ", value_2005)
philly_2005['Rental Rate']
# select 2017
philly_2017 = philly.loc[philly['Year']==2017]
value_2017 = philly_2017['Rental Rate'].iloc[0]
print("2017 value = ", value_2017)
# calculate percent change
percent_change = (value_2017 - value_2005)/ value_2005 * 100
print("change in rental rate in Philadelphia from 2005 to 2017 = ", percent_change)
For more details on the iloc() function, see the documentation on indexing by position.
Load citywide Zillow Rent Index (ZRI) and Zillow Home Value Index (ZHVI) data.
Files were downloaded from https://www.zillow.com/research/data/
home_values = pd.read_csv("./data/zillow/Metro_Zhvi_AllHomes.csv", engine='python')
rent_values = pd.read_csv('./data/zillow/Metro_Zri_SingleFamilyResidenceRental.csv', engine='python')
Peek at the first few rows of the ZRI data:
rent_values.head()
And do the same for the ZHVI data:
home_values.head()
valid_cities = [
"New York, NY",
"Chicago, IL",
"Los Angeles-Long Beach-Anaheim, CA",
"Philadelphia, PA",
"Houston, TX",
"Phoenix, AZ",
]
selection = home_values['RegionName'].isin(valid_cities)
home_values_trimmed = home_values.loc[selection]
selection = rent_values['RegionName'].isin(valid_cities)
rent_values_trimmed = rent_values.loc[selection]
rent_values_trimmed
Unwanted columns can be dropped from the data frame using the drop() function.
unwanted = ['SizeRank', 'RegionID']
home_values_final = home_values_trimmed.drop(unwanted, axis=1)
unwanted = ['SizeRank', 'RegionID']
rent_values_final = rent_values_trimmed.drop(unwanted, axis=1)
rent_values_final
Currently, our data is in wide format $\rightarrow$ each observation has its own column. This usually results in many columns but few rows.
home_values_final
Usually it's better to have data in tidy (also known as long) format.
Tidy datasets are arranged such that each variable is a column and each observation is a row.
In our case, we want to have a column called ZRI and one called ZHVI and a row for each month that the indices were measured.
pandas provides the melt() function for converting from wide formats to tidy formats.
pd.melt?
Now, let's melt our datasets:
ZHVI = pd.melt(
home_values_final, id_vars=["RegionName"], value_name="ZHVI", var_name="Date"
)
ZRI = pd.melt(
rent_values_final, id_vars=["RegionName"], value_name="ZRI", var_name="Date"
)
and take a look:
ZRI.head()
ZHVI.head()
Note that missing data values are represented as NaN ("not a number")
Another common operation is merging, also known as joining, two datasets.
We can use the merge() function to merge observations that have the same Date and RegionName values.
zillow_data = pd.merge(ZRI, ZHVI, on=['Date', 'RegionName'], how='outer')
zillow_data.head()
Merging is very powerful and the merge can be done in a number of ways. See the infographic on joining in this repository.
Currently our Date column is stored as a string.
pandas includes additional functionality for dates, but first we must convert the strings using the to_datetime() function.
# convert to data
zillow_data['Date'] = pd.to_datetime(zillow_data['Date'])
Quick trick: Series that hold Datetime objects have a dt attribute that let's you grab parts of the date easily.
For example, we can easily add new columns for the month and year using:
# Note the the dtype is now datetime64[ns]
zillow_data['Date'].head()
zillow_data['Month'] = zillow_data['Date'].dt.month
zillow_data['Year'] = zillow_data['Date'].dt.year
zillow_data.head()
Year¶pandas is especially useful for grouping and aggregating data via the groupby() function.
From the pandas documentation, groupby means:
The documentation is available here.
We can calculate annual averages for each year by grouping by the RegionName and Year columns and taking the mean of our desired column. For example:
# calculate mean values for each Year and City (RegionName)
annual_ZHVI = zillow_data.groupby(['RegionName', 'Year'])['ZHVI'].mean()
annual_ZRI = zillow_data.groupby(['RegionName', 'Year'])['ZRI'].mean()
print(type(annual_ZHVI))
annual_ZHVI.head()
Note that here that the result is indexed by the columns we grouped by (RegionName and Year).
We can reset the index so that the index values are listed as columns in the data frame again.
annual_ZHVI = annual_ZHVI.reset_index()
annual_ZRI = annual_ZRI.reset_index()
annual_ZHVI.head(n=50)
with plt.style.context("ggplot"):
# Create figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# Plot for each unique city
for city in annual_ZHVI["RegionName"].unique():
# select the data for this city
selection = annual_ZHVI["RegionName"] == city
df = annual_ZHVI.loc[selection]
# plot
ax.plot(df["Year"], df["ZHVI"] / 1e3, label=city, linewidth=4)
# Format the axes
ax.set_ylim(50, 800)
ax.legend(loc=0, ncol=2, fontsize=12)
ax.set_ylabel("Zillow Home Value Index\n(in thousands of dollars)")
with plt.style.context('ggplot'):
# Create the figure and axes
fig, ax = plt.subplots(figsize=(10,6))
# Loop over the cities to plot each one
for city in annual_ZRI['RegionName'].unique():
# Select the city data
selection = annual_ZRI['RegionName'] == city
df = annual_ZRI.loc[selection]
# Plot
ax.plot(df['Year'], df['ZRI'], label=city, linewidth=4)
# Format
ax.set_ylim(1000, 3300)
ax.legend(loc=0, ncol=2)
ax.set_ylabel('Zillow Rent Index (in dollars)')
Available on GitHub:
Two parts:
Due date: next Thursday 9/10 by the start of class (6pm)